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To date, the majority of computerized adaptive 
testing (CAT) systems for achievement and aptitude testing have been 
based on the dichotomous item response models^ However, current 
research with polychotomous model-based CATs is yielding promising 
results. This study extends previous work on nominal response 
model-based CAT (NR CAT) and compares its ability estimation as well 
as its overall performance to graded response model-based CAT (GR 
CAT). A data set of 275 examinees was used, derived from five 
administrations of the College Board's Achievement Test in 
Mathematics, Level I, at the University of Texas, Austin. Results 
show that both CATs had high convergence rates despite using a small 
item pool and had average test lengths slightly below 16 items. The 
NR cat's ability estimates were highly correlated with and not 
significantly different from an external criterion and showed no 
systematic bias in estimating ability throughout the trait continuum. 
In contrast, the GR CAT had a tendency to underestimate high ability 
examinees, although its ability estimates were highly associated with 
the external criterion. Educational implications of the findings 
include the possibility of merging computer-aided instruction and 
diagnostic testing with CAT. Six tables and nine graphs are included. 
(Author/TJH) 
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ABSTRACT 

A conparison of the rotiinal and graded response models 
in canputerized testing 

ON 

^ R.J. De Ayala 

University of Maryland 
^ Barbara Dodd & William R. Ftoch 

' CD Universi1:y of Texas 

t UJ 

To date, the majority of cxsrtpiterized adaptive testing 
(CAT) systems for achievement and aptitude testing have been based 
on the dichotofftms item response models. However, cairrent research 
with polydiotoamous model-based CATs is yielding premising results. 
This stucfy extends previous work on a nominal response model-based 
CAT (NR CAT) and conpares its ability estimation as well as its 
overall perfonnaix:e to a graded response model-based CAT (GR CAT) . 
Results showed that both CATs had hi^ cx^nvergence rates despite 
using a small item pool and had average test lengths sli^tly below 
16 items. Ihe NR CAT»s abili1:y estimates were hi^y correlated 
with and not significantly different from an external criterion and 
showed no systematic bias in estimating ability throu^out the 
trait continuum. In contrast, the GR CAT had a tendency to under- 
estimate hi^ ability examinees, althou^ its ability estimates 
were hi^y associated with the external criterion. Implications 
of NR and GR model-based CATs are discussed. 
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A cxxiparison of the rotiinal and graded response models 
in can?)uterized testing 



R.J. De Ayala 
Universi-ty of Maryland 
Barbara Dodd & William R. Koch 
Universi1:y of Texas 

Research on the use of polychotarous models in can- 
puterized adaptive testing (CAT) has siiown promsing results (e.g., 
Koch & Dodd, 1985; De Ayala & Kbch, 1987; Syirpson, 1986). Ihera 
are a number of benefits vfalch may be accrued through the use of 
polychotcanous models in CAT. For instance, a new doa:nain of test 
items v*iich do not require transforming the examinee's actual 
responses into dichotanous responses (e.g^, correct and incorrect) 
may be administered in an adaptive fashion; the term scoring will 
refer to this transformation. These items may be either attitude 
items or test items specifically developed for administration by a 
coirpiter (i.e., "coirputerized" items; currently "paper-and-pencil" 
items are used in CAT) . 

Two polychotaftous models appropriate for attitude as well 
as aptitude and achievement testing are Samejima's (1969) graded 
response (GR) model and Bock's (1972) nominal response (NR) model. 
Both models share an indirect relationship. Specifically, the GR 
model is a direct extension of the two-parameter model, viiereas the 
NR model reduces to the two-parameter model when an item only has 
two categories (i.e., correct and incorrect). 

Samsjima (1969) extended the dichotomously scored two- 
parameter model to the case of polycJiotomously scored items with 
ordered responses. For the GR model the examinee responses to item 
i are categorized into m. + 1 categories; where hi^er categories 
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indicate more of 9. Associated with each category of item i is a 
category score, x^, with values 0. .m^. The GR itodel itay be 
expressed as : 

^Da.(9-b^) 

Pj^Ce) = (1) 

Da.(e-b ) 
1 + e ^ ^ 

v^ere D is a scaling constant, e is the latent trait, a^ is the 
discrimination parameter for item 1/ is the difficulty 
parameter for category score x for item i. P is the probability, 
, of the examinee responding in category score x^ or hic^ier for 
a given item; the p of responding in the lowest category (i.e., 
PqCB)) or hitler is 1.0. For instance, for an item with four 
response categories P^Ce) is the probability of responding in cate- 
gories 2 or 3 rather than in categories 0 or 1. Because P is the 

X 

prcdDability of responding in x . or hi^ier the p of responding in a 
particular category equals the difference between P s for adjacent 
categories. 

For the GR model item responses must be ordered a priori in 
order to fit this model. Therefore, Likert scale items or apti- 
tude/achievement test items v^ose alternatives have been ordered 
according to kncwledge of the correct answer are appropriate. 

Samejima (1977) showed that using the GR model in a CAT 
resulted in the administration of approximately 25% fewer items 
than a CAT using dichotcmous model. Further, Samejima found that 
the GR CAT "presenting" itaiB with lower as was as efficient as the 
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binary CAT v*dch administered items with larger as (e.g. ,a values 
of 1.0 and 2.0, respectively) ^ 

In contrast to the GR nvodel. Bock's NR nodel assumes that 
item alternatives represent responses loeasured at a nominal level 
of measurement. Bock's NR model provides a description of the 
probability of a response to eacti category as a function of two 
parameters ciiaracteristic of the particular category as well as an 
ability parameter. These category probabilities for an item are 
conditional on e and are constrained to suir; to l.O. Ihe NR model 
is e)5)ressed /is : 

a (0) + c 
e ^ X 

p(u^=x|e;a;c) = (2) 

kr=l,m^ 

where p(e) is the probability that a subject with ability level 9 

will provide an item response, u^=x, to item i with m^ categories. 

The item parameters, a^^ and c^, are associated with the k 

category of item i. Specifically, a^^^ is the slope parameter and c^ 

is the intercept parameter of the nonlinear response function 

th 

associated with the k category of an m category item. 

Bock (1972), Ihissen (1976), and De Ayala and Kbch (1987) 
have all shown that the NR model provides more information than a 
diciiotcQfrous model, particularly in the lower half of the ability 
distribution, ihis information can be used by a NR model-based CAT 

A 

to provide more precise Gs than a three-paraiteter logistic (3PL) 
model-based CAT in the lowpr half of ability range (De Ayala, Dodd & 
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Koch, in press) . Furthemore, the NR CKT v?as found to have a 
hitler oonvergenoe rate than the 3PL CAT vAiile acindnistering 
approximately the sanve number of items. Both the 3PL and NR CATS' 
ability estimates were hi^ily correlated with and not significantly 
different from an external criterion. 

Both the GR and NR models have demonstrated the ability to 
utilize the information in examinees' incorrect responses for 
ability estimation. However, the models differ with respect to 
their inplicit assuitptions about the level of measurement inherent 
in the item's alternatives and in the nxmiber of parcaneters requited 
to describe an item. The NR model requires more parameters to des- 
cribe the examinee-item interaction than does the GR model, but 
does not have the GR requirement that the item's alternatives be 
ordered. Therefore, there is a trade-off of advantages and disad- 
vantages between the two models. This study investigates x^*lich 
model may be preferable for CAT. 
METHOD 

Data : A data set of 1093 examinees was created from five admin- 

1 

istrations of The College Board's Achievement Test in Mathematics, 
level I, at the University of Texas at Austin. This data set con- 
tained only individuals who answered at least 80% of the 50 item 
test and the last question. Each CAT program simulated an adaptive 
test for each examinee in the data base. For both CATs the un- 
scored responses were tased. 

To inprove item parameter estimation and to work within the 
constraints of the calibration program, MUIHTIDG 5 (Thissen, 1986) , 

*^rhe authors wish to thank The College Board for granting per- 
mission to use the Mathematics Level I Achievement test data. 
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the 5-choice items of the Math Level I test were collapsed into 4- 
choice items (a.k.a, the Ctollapsed data set) . For the GR model's 
item calibration it was necessary to order the item alternatives. 
Hie ordinal relationship aitong the alternatives for an item was 
obtained by arranging the colons accordir^ to the mean number 
correct score of examinees selecting each c^ion. That is, the 
alternative with the largest iflean number correct score was con- 
sidered to reflect more of the trait (i.e., the c^imal response) 
than the alternative with the second largest mean number correct 
score, etc. In general, the optimal response for an item was also 
its correct response. MUUTILDG 5 was used for the NR and GR 
models' item calibration of the Collapsed data. 

Because the CAT simulations required ccirplete response 
strings, a new data set containing only those Collapsed data set 
examinees with no non-responses v;as created. Of 1093 eocaminees 
used for item calibration, 275 examinees were found to have 
ansv^ered all items; this data set is called the Complete data set. 
An ordered alternative version of ccsrplete data was used with the 
GR CAT. 

Procrr?ins : A CAT program based on the GR model (called GR 
CAT) and another program based on the NR model (called the NR CAT) 
were written. Both programs used maximum likelihood estimation 
(MIE) of abilil^ and maximum item information for item selection. 

The adaptive testing simulation was terminated viien either 
of two criteria were met : a maximum of thirty items was reached or 
when a predetermined standard error of estimate (SEE) was attained. 
A previous study (Koch, De Ayala, & Dodd, 1988) demonstxated that 
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the use of maximuin SEE as a termination criterion yielded resiilts 
vtfhidi were preferable to those using itiiniinum item information as a 
termination criterion. The itaximum SEEs were determined enpiri- 
cally frcm simulation results using snail sairples. Specifically, 
the SEE value which minimized the average number of items adminis- 
tered v^le sijnultaneously resulting in a large linear association 
between the ability estimates and an external criterion was 
selected for use in this study. A value of 0.45 was losed as the 
maximum SEE value for the NR CAT and the GR CftT's maximum SEE was 
set to 0.44. Further, for both CATs the initial ability estimate 
for an examinee was assumed to be equal to the pc^xilation maan 
(i.e. , 9 = 0.0) . 

In the GR CAT a variable stepsize was added (after a 
response of 3 or more) or subtracted (after a response less than 3) 

A A 

from the previous 8 in order to provide a new 9 v*ien MLE could 
not be parformed (Koch, De Ayala, & Dodd, 1988) . This variable 
stepsize procedure was used until MLE could be perfonr^. 

Because there are no correct or incorrect responses with 
the NR CAT, the above technique was ncdified based on the fact that 
the prc±>abilitY of responding in a given category varies as a 
function of 9. Specifically, until MLE could be perfonned direc- 
tional information for the addition or subtraction of the fixed 
stepsize was obtained by : (a) calculating the probability of 
responding in the Cdtegory the examinee chose for the range 4-0.5 9 

A A 

units about the previouo 0; and (b) the sign of the 9-9 
associated with the largest probability in this 9 range. If the 
sign of 9 from (b) was positive then a stepsize of 0.20 was added 
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to the previcxis e, otherwise the stepsize was subtracted from the 

A 

previous e. This stepsize caine frcra analysis of sirall saitpls 
simulation runs with the NR CAT presented above. 

The CAT siinulations were analyzed with respect to their 
nonconvergent and convergent cases as well as those cases ^Aiich 
were convergent for both CATs (referred to as jointly convergent 
cases) • In addition to descriptive statistics on es, SEE, and 
number of iteitis administered (NIA) , the appropriate test character- 
istic curve for the item pool was used to convert the 0s from each 
program to a number correct scor^ (NC estimate) . This NC estimate 
was conpared with the actual number correct score (NC enpirical) 
the examinee received on the exam. It was felt that NC eirpirical 
was an unbiased criterion for evaluating performance of the two 
different models. Unless otltierwise stated all correlation coeffi- 
cients are Pearson Product Moment correlation coefficients. 
RESULTS 

Tables 1 and 2 present descriptive statistics and factor 
analyses of the two data sets, Collapsed and Conplete. For each 
data set, Tahle 1 shows the mean NC empirical and its stardard 
deviation (S.D.) as well as coefficient alpha (alpha). In order to 
get an indication of the dimensionality of the data a linear factor 
analysis with jdii coefficients was performed. As can be seen from 
these tables, the elimination of examinees with incarrplete response 
strings from the Collapsed data set did not meaningfully distort 
the examinee information in the Ooitplete data set. Further, 
althcu^ neither data set had a first factor which accxainted for a 
large percentage of the total variance, each data set's first 
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factor did acxx)unt for a large proportion of the common variance. 
It was concluded that the data sets did not seriously violate the 
unidimensionality assumption. 

Insert Table 1 about here 



Insert Table 2 about here 

Table 3 presents the LBSults of the GR and NR iryodels' cali- 
bration of the Collapsed data set. As can be seen frcan Table 3, 

A 

the es frm each itvodel were, as would be expected, hx^ily cor- 
related with one another (r = 0.99; r-^ = 0.99) and 98% of 

Spearman 

A 

the proportion of variance in one irtodel's es was accounted for by 

A 

the other inodel's Gs. Figure 1 presents a scatterplot of the 

A 

calibration Gs from each model depicting the strong linear rela- 
tionship between the two variables and the agreement between the 

A 

models on their Gs for each examinee. 

Insert Table 3 about here 



Insert Figure 1 about here 

Because the quality of the items' parameter estimates (at 
least with respect to G) and an item's usefulness can be better 
summarized by information than by summary statistics for each item 
parameter, the items' information (Samejima, 1979) for each model's 
parameter estimates was calculated. These item information func- 
tions are summarized in the test information function (Bimbaum, 
1968) . Each model's respective test information function is pre- 
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sented in Figure 2. As can be seen frcan this figure, the NR and GR 
itKDdels' infomation functions are e^entially identical and both 
models' functions peak approximately at 9 = -2.0. The GR model 
provides sli^tJLy more information below 9 = -2.25 than does the NR 
model, viiereas the NR model provides more information than the GR 
model in the range -2.25 to l.O, 

Insert Figure 2 about here 

The results of the GR CAT simulation's convergent cases are 
presented in T&ble 4* As can be seen, the GR CAT converged on all 
275 cases* The average test length was 15.8 items (median = 16) ; 
no simulated test reached the termination criterion of 30 items. 
The average CAT NC estimate and the average NC empirical differed 
by 2.8 (25.0 and 27.8, respectively) and by 2.4 with respect to 
their median values (NC estimate median = 24.6, NC empirical median 
= 27 . 0) • With the large sample size it was not une35)ected that the 
matciied t-test showed that the mean difference between NC empirical 
and NC estimate was statistically significant (t = -10,48, df = 
274, p = 0.0001) ; this significant finding resulted primarily from 
tl-^e GR CAT'S difficulty in estimating hi^ ability examinees. In 
fact, the NC estimates were not significantly different frcan NC 
empiricals for NC empiricals below or equal to 25 (t = -1.83, df = 
111, p = 0.07). 

Insert Table 4 about hare 

A scattergram of NC estimate and NC empirical (Figure 3) 
shows the GR CAT had a tendency to underestimate hi^ ability 
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examinees, althou^ the cx>rrelation coefficient between NC estiitate 
and NC eirpirical was 0.85. The coefficient of determination for 
pred:.cting NC enpirical frcm NC estimate v;as 0.73; 73% of the 
variability in NC eirpirical was accounted for by NC estiitate. ihe 
average GR CAT 0 of --0.43 (median = -0.52) was significantly 
different from the calibration's average 0 of 0.02 (median -0.14) ; 
t = -12.38, df = 274, p = 0.0001. 

Insert Figure 3 about here 

Figure 4 depicts the relationship of the difference between 
xiie GR CAT'S NC estimates and their enpirical values as plotted 
against NC enpirical. As can be seen, for examinees with NC 
enpiricals greater than 25 the GR CAT had a strong tendency to 
underestiinate their ability. However, the GR CAT did not show this 
bias towards examinees with NC enpiricals below or equal to 25. 
Given the GR model's total test information function (Figure 2), 
which indicated that the model did not provide very much informa- 
tion for this ipper range of ability, this underestimation of hi^ 
ability examinees was not une>q)ected. 

Insert Figure 4 about here 

The NR CAT'S convergent cases results are summarized in 
Table 5. Ihe NR CAT converged on 97.8% of the total cases 
(269/275) . Ttie NR CAT'S average 0 was -0.07 (median = -0.20) with 
an average test length of approximately 15.7 items (median = 14) ; 
46 cases received adaptive tests of the maximum test length. The 
average NC estiitate fron the NR CAT was 27.3 (median = 27.5) with 
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an average NC aipirical of 27*4 (insdian = 27.0) ; the difference 
beta'.^8n the mean NC estiirate and mean NC snpirical was not statis- 
tically significant (t = -0,47, df •= 268, p = 0.64) . Further, des- 

A 

pil-e the large sarple size the NR CAl s itiean 9 was not signifi- 
cantly different from the NR model's calibration average 9 for 
these examinees (t = -0.63, df = 268, p = 0.53) . 

Insert Table 5 about here 

Figure 5 presents the scatterplot of NC estimate vesrsos NC 
enpixical. As can be the scattering of points about, a strai^t 
line is saflDstantially lass than found with the GR CAT's NC esti- 
itate/NC enpirical plot and there does not appear to be any meaning- 
ful bias thrxxK^out the ability scale. Ihe regression of NC enpi- 
rical upon NC estimate showed that over 85% of the variability in 
NC enpirical was "explained" by NC estimate; the correlation coef- 
ficient between NC estimate and NC enpirical was 0.93. Both these 
values were substantially greater than those of the GR CAT. 

Ip-sert Figure 5 about here 

A plot of the difference between NC estimte and NC enpi- 
rical versus NC enpirical is presented in Figu'^ 6. Except for 
sotve cases of hi'^ ability vAiere the NR CAT underestimated the 
examinee's ability, there does not appear to be a systematic 
tendency to :jver^ or iinderestimate ability across the ability 
continuum. Purther, the sparcity of points in the ipper ability 
range, rei.ative to the GR CAT, is an indicator that the NR CAT's 
sli^t convergency problem exists in the ipper ability region. All 
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6 nonconvergent cases were examinees v^ose NC enpirical were in th*^^ 
range 46 to 49, where the NR nvodel did not pi^ide very ntudi infci 
nation. As was the case for the GR CAT, there were a few cases 
with NC errpiricals between 20 and 30 vMch were underestimated by 
about 10 points. Given the information available to the NR CAT in 
the upper ability range it was surprising that the NR GAT performed 
so well. 



Insert Figure 6 about here 
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To carpare the two CATs directly the 269 jointly convergent 
cases were analyzed. Ihe summary statistics on the jointly conver-- 
gent cases for the GR CAT are presented in Table 6. Because the GR 
CAT converged on 100% of tlie cases only its results will be altered 
by examination of the jointly convergent cases. Iherefore, Table 5 
contains the NR CAT'S results to be used for coitparison with the GR 
CAT. 

As can be seen from Tables 5 and 6, the average and median 
NIAs for the two CATs were almost equal, althou^ the NR CAT had 
substantially more variability in the number of items administered. 
However, a corparison of the two CATs with respect to their average 
NC estionate shells that the NR CAT performed substantially better in 
estimating NC eirpirical; as stated above, the average NC enpirical 
ard the NR CAT's average NC estimate were not significantly dif- 
ferent from one another. The strong positive correlation coeffi- 
cient of 0.93 between NR CAT^s NC estimate and NC empirical and 
Figure 5 indicate that this nonsignificant result was not due to 
large positive differences offsetting large negative differences. 

14 



Insert Table 6 about here 



Figure 7 depicts the relationship between the difference 
between the GR CAT'S NC estimate and NC enpirical versus NC 
empirical* As would be e>q)ected, the elimination of the NR CAT'S 
nonoonvergent six cases did not have a iteaningful iirpact on the GR 
CAT'S underestimation of hi^ ability examinees. Furthermore, 
matched t-tests of the difference between NC estimate and NC 
enpirical as well as the difference between GR CAT'S es and the GR 

A 

moQOl's calibration es were statistically significant, t = -10.05 
and t = 12.19 (for both tests : df = 268, p = 0.0001) , respective- 
ly- 

Insert Figure 7 about here 

DISCUSSION 

In this simulation stucfy both CATs performed well despite a 
small item poo2 (50 items) . For instance, the two CATs had very 
hi^ convergence rates and provided ability estimates vAiich were 
highly correlated with an external criterion vAiile administering, 
on the average, less than 16 items. Ha^er, given the results 
presented above (e.g., the matched t-test results, the plots of the 
difference between the NC estimate - NC enpirical versus NC enpiri- 
cal) the NR CAT'S performance was si5)erior to the GR CAT's. The NR 
CAT'S ability estimate did not show the systematic underestimation 
of lii^ ability examinees vMcJi was prevalent in the GR CAT's abi- 
lity estimates. In addition, the NR CAT's NC estimates were hi^aly 
correlated with and predictive of NC enpirical. In contrast to the 
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GR CAT, the NR caJT did have 17% of its convergent cases terminated 
by the inaxijtram test length criterion. Kowever, tlie CAT had a 
sufficiently large number of cases with small test lengths so that 
its average as well as its median NIA was aE^roxiinately equal to 
that of the GR CAT. 

Of particular interest was the similarity of the models ' 
total test information functions. Ihe GR and NR models' describe 
the test similarly with the two models showing differences from one 
another only belov; 0 = -1.0. The similarity of the information 
functions in the the upper range of the 0 scale was not unejq^ected. 
In this range both models are face with a reduction in the 
variation of responses from the hic^er ability examinees. As 
ability incre:;ses, examinees ma]<e fewer and fewer incorrect re- 
sponses and the majority of the incorrect responses vdiich do occur 
are most likely accounted for by one or two incorrect alternatives 
vMch are especially attractive to hi^ ability examinees. There- 
fore, both models are extracting information from \*iat is progres- 
sively becoming dichotcmous-like data. 

For es below -1.0 there is a great deal of variability 
in the examinees' responses and the differences between the two 
models became apparent. It can be seen -am Figure 2 that the GR 
model abstracts more information than the NR model for Gs below 
approximately -2.25. In contrast, the NR model provides more 
information in the range -2.25 to l.O (approximately). Assuming a 
N(0,1) distribution of ability, the NR model provides more infor- 
mation than does the GR model for approximately 83% of the exam- 
inees. 
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This relationship between the two inforration functions may 
result from the inf orroation provided to the GR model by the ordinal 
relationship of the item*s alternatives (i.e,, the large response 
values indicate hi^er ability than lower response values) • The GR 
model can use this additional information for estimating the item 
and examinee parameters whereas tlds information is not available 
for the NR model's estiitation. Ihis ordered alternative informa- 
tion may, to a certain extent, offset the influence of guessing on 
the item parameter estimates (e.g., by reducing the effect of gues- 
sing on the estimation of the discrimination parameter) and thereby 
increase the information available for low ability examinees. If 
this is true, then a NR-like model with a pseudo-guessing par-ameter 
(e.g. , Ihissen & Steinberg's multiple-choice model (1984) , or its 
eqiiivalent, Synpson's Model III (1983)) should subsume the GR 
model's information function in the lower range of e. Alterna- 
tively, the GR/m difference below 9 = -2.25 may be an artifact 
reflecting the inaccuracy of estimating the information function at 
these extreme levels of ability where there are conparatively fewer 
observations than around 9 = 0.0. 

Given the similarity of the information functions in the 
i:qpper half of the ability scale it is surprising that the two CATs 
would perform differently in this range of the 9 continuum. The GR 
CAT consistently underestimated examinees with NC eirpiricals 
greater than 30. Recall that the longest GR CAT siinulated test was 
21 items, therefore, the maximum SEE value was not set to low for 
these hi^ ability examinees. An additional simulation (1^275) of 
the GR CAT with the maximum SEE set to 0.34 was performed. Analy- 

17 
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sis of these data showed aii increase test length of, on the 
average, 10 it^ans and did not alleviate the GR CAT's prc±)lem with 
tanderestimation of hi^ ability examinees. 

One potential ejqplanation for the GR CAT's bias in esti- 
mating hi^ ability examinees may lie in the distribution of an 
item's information and its interaction with the item selection 
strategy. Samejima (1969) has shewn that the closer the category 
difficulty parameters are to one another the greater the informa- 
tion for a narrow 0 range (i.e., a leptokurtic item information 
distribution) . Conversely, the greater the distance between b^ axid 
^x+1 broader the distribution of item information (i.e., the 
information is distributed ovar a larger 0 range at the es^^ense of 
becoming more platykurtic) . 

Inspection of the item information functions for items with 
sonve positive bs (no it^ had all positi\^e bs) showed that the 
majority of these functions were relatively platykurtic. For 
instance. Figure 8 presents item information functions for such a 
set of items, specifically items 44 - 50. Ttie items with rela- 
tively flat information functions (items 44, 46 and 49) had large 
differences between b^ and b^ (9.09, 6.41, and 7.90^ respectively) 
and, as would be e>cpected, these items also had small as (0.23, 
0.42, and 0.42, respectively). Ihe coirparatively peaked informa- 
tion function (item 48) had a b-j^/bg difference of 2.39 and an a of 
1.13. 



Insert Figure 8 about here 
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Figure 9 presents the NR model's information functions for 
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17 



these same items (items 44 - 50) . Except for items 44 and 49, 

OTrrBsporjding functions based on the GR model. This increase in 
information over a narrcwer range probably results frcm the MR 
model's use of a parameter v4iich indicates the alternative's, not 
the item's, capability to discriminate among different abilities. 
In contrast, the GR model uses an "average-like" (i^e. , over all 
alternatives) discrimination parameter for assessing the item's 
capaci-ty to differentiate among abilities* 

Insert Figure 9 about here 

The duplications of this approach beccffne more apparent in 
the upper ability range vSiere a fewer number of alternatives are 
accounting for more of the examinee responses* Due to the decrease 
in responses to the less attractive alternatives, these distractors 
will not discriminate well among the examinees and their parameters 
will not be well estimated. These poorly discriminating alterna- 
tives will, in effect, attenuate the discrimination of the more 
attractive alternatives. The item's a will reflect both the 
alternatives v*iich discriminate well and those that do not. The 
attenuation effect of poorly discriminating alternatives will not 
be as great in the MR model because each alternative's a will 
assess the option's capacity to discriminate among examinees. 
The alternative (s) which discriminate well will have large as, 
whereas the poorly discriminating altemative(s) will have low a 
values. 

It is believed that, for the GR CAT, the proclivity of 



19 



18 

items which provide the irK>st infonnation in the ijpper ability range 
to have broad and/or platykurtic item information functions (i.e. , 
low discriminatory power) results in the underestimation of hi^ 
ability examinees, That is, the initial administration of :ii^.y 
discriminating items provides sufficient information to indicate 
that the examinee's ability is substantially different from 0.0. 
As a result, after the administration of a few items there have 

A 

been conparatively large changes in 0. However, the saisequent 
administration of poorly discriminating items only sli^tly 

A A 

increases 0 beyond the 0 prior to their administration. In 
contrast to the GR CAT, for the xjpper half of the e scale ^'he NR 
CAT has available items vMch are comparatively irore discriminating 
at certain Gs than at other es and therefore, are more likely to be 

A 

selected for the 0 at which their information is most localized. 
The administration by the NR CAT of these more discriminating 
items results in the appropriate adjustment (s) of 0. 

It is interesting to note that the similarity of the total 
test information functions for the GR and NR models concealed the 
differences between the two models in the way the item information 
function''^ are distributed in the i:pper ability range. 

The educational iitplications of polychotoanous model-based 
coirputerized adaptive testing include the possibility of merging 
ccatputer^-aided instruction and diagnostic testing with CAT* It 
should be noted that a polydiotomous model-based CAT could use both 
items which have to be scored as well as those that do not need to 
be scored. However, the lack of an item scoring requirement for a 
polychotomous model-based CAT would allow for conputer-aided item 
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creation for the item pool, a coitplete IRT item analysis (Ihissen & 
Steinberg, 1984) , and for the development and use of nev/ and inno- 
vative item formats. In this latter case, these polychotcanous 
items represent a new domain of items v*iich may be vised in adaptive 
testing environments and vMdi wou].d be develcp sp©::if ically for 
polychotomous models. 
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Table 1 : Dascriptivs Statistics of the data sets 



Data Set N 
Collapsed 1093 
Conplete 275 



Table 2 : Principal Axes Analyses 
greater than 1.0. 



Data Set 

Collapsed 

Conplete 



NC Mean S.D. Alpha 
27.3 7.5 0.85 

27.8 8.5 0.88 

Factors with eigem'alues 

Variance Accoonted 
for by Factor I 
Total Common 
13.8% 74.7% 
13.8% 74.7% 



EigeiTvalues 
I II III 
.902 1.299 1.036 
.902 1.299 1.035 



Table 3 



Model 
GR 

Nominal 



es f ran the GR and Nardnal Response models 

calibrations of the Collapsed data set (N=1093) 



3 Mean SO Min Max Ifedian i 
0.12 0.87 -2.05 2.88 0.078 
0.06 0.96 -2.17 3.05 0.017 0, 



99 



r 
0.99 



Pearson Product Mcanent Correlation Coefficient )?etween 
Nominal Response and GR Models' Calibration Bs 

Spearman Rank-Order Correlation Coefficient betj^/een 
Nominal Response arid GR Models' Calibration Bs 

Table 4 : GR CAT simulation's convergent cases (N = 275) . 



A 


Mean 


SD 


Min 


Max 


Mdn 


e 


-0.43 


0.86 


-2.83 


1.53 


-0.52 


SEE 


0.43 


0.01 


0.37 


0.44 


0.43 


NIA 


15.8 


1.37 


14.0 


21.0 


16.0 


NC estimate 


25.0 


6.60 


7.6 


38.3 


24.6 


NC etrpirical 27.8 


8.51 


10.0 


49.0 


27.0 



0.85 



^ Pearson Product Moanent Correlation Coefficient between 
NC estimate and NC enpirical 

Table 5 : Nominal CAT simulation's convergent cases (N = 269) 



A 


I'fean 


SD 


Min 


Max 


Mc3n 


e 


-0.07 


1.28 


-2.80 


3.6 


-0.20 


SEE 


0.46 


0.05 


0.41 


0.70 


0.45 


NIA 


15.7 


7.82 


7.0 


30.0 


14.0 


NC estimate 


27.3 


8.54 


6.7 


44.5 


27.5 


NC enpirical 27.4 


8.07 


10.0 


46.0 


27.0 



0.93 



Pearson Product Moment Correlation Coefficient between 
NC estimate and NC enpirica?. 
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Table 6: GR CAT simulation - jointly convergent cases (N = 269) . 



^ Mean 


SD 


Min 


Max 


Mdn 


e -0.46 


0.84 


-2.83 


1.53 


-0.53 


SEE 0.43 


0.01 


0.37 


0.44 


0.43 


NIA 15.8 


1.34 


14.0 


21.0 


15.0 


NC estimate 24.8 


6.51 


7.6 


38.3 


24.4 


NC eirpirical 27.4 


8.07 


10.0 


46.0 


27.0 



Pearson Product Mcfment Correlation Coefficient between 
NC estimate and NC enpirical 
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Figure 1 



GR and NR Calibration Thetas 
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Sl\ Model's Calibration Thetas 
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Figure 2 



Information Functions for GR S NR Models 
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Figure 3 
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Figure 4 



GR CAT'S NC Estimate - NC Empirical 
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Figure 5 



NR convergent cases : NC Estimate vs. NC Empirical 
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Figuine 6 



NR CAT'S NC Estimate - NC Empirical 
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Figure 7 



GR CAT'S NC Estimate - NC Empirical 
jointly convergent cases 
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Figure 8 



GR : Item Information Functions : Items 44 - 50 
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Figure 9 



NR : Item Information Functions : Items 44 - 50 
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